Report the similarity between two structures
Use the toolkit's preferred comparison method to compare two different molecules for similarity. The result must be 0.0 if the molecules are not at all similar and 1.0 if they are completely similar. A common task in cheminformatics is to find target structures in a data set which are similar to a query structure. The word "similar" is ill-defined. What we have instead are well-defined measures which are hopefully well correlated to what a chemist would call similar. Each toolkit has different methods for doing this. Some use hashed fingerprints and Tanimoto, Others use feature keys. OEChem's preferred comparison is with LINGOS, which is based on the uninterpreted SMILES. Some toolkits might even use shape descriptors. Implementation Report the similarity between "CC©C=CCCCCC(=O)NCc1ccc(c(c1)OC)O" (PubChem CID 1548943) and "COC1=C(C=CC(=C1)C=O)O" (PubChem CID 1183). CDK/Groovy Save as calcTanimoto.groovy and run with: groovy calcTanimoto.groovy @GrabResolver( name='idea', root='http://ambit.uni-plovdiv.bg:8083/nexus/content/repositories/thirdparty/' ) @Grapes([ @Grab( group='org.openscience.cdk', module='cdk-fingerprint', version='1.4.11' ), @Grab( group='org.openscience.cdk', module='cdk-silent', version='1.4.11' ) ]) import org.openscience.cdk.fingerprint.*; import org.openscience.cdk.smiles.*; import org.openscience.cdk.silent.*; import org.openscience.cdk.similarity.*; smilesParser = new SmilesParser( SilentChemObjectBuilder.getInstance() ); smiles1 = "CC©C=CCCCCC(=O)NCc1ccc(c(c1)OC)O" smiles2 = "COC1=C(C=CC(=C1)C=O)O" mol1 = smilesParser.parseSmiles(smiles1) mol2 = smilesParser.parseSmiles(smiles2) fingerprinter = new HybridizationFingerprinter() bitset1 = fingerprinter.getFingerprint(mol1) bitset2 = fingerprinter.getFingerprint(mol2) tanimoto = Tanimoto.calculate(bitset1, bitset2) println "Tanimoto: $tanimoto" Indigo/Python This example calculates the similarity based on two kinds of fingerprints: similarity fingerprints and substructure fingerprints. Similarity fingerprints are shorter, substructure fingerprints are more descriptive. For each of two kinds of fingerpints, both Tanimoto and Tvesky similarity values are written to standard output. from indigo import * indigo = Indigo() m1 = indigo.loadMolecule("CC©C=CCCCCC(=O)NCc1ccc(c(c1)OC)O") m2 = indigo.loadMolecule("COC1=C(C=CC(=C1)C=O)O") # Aromatize molecules because second molecule is not in aromatic form m1.aromatize() m2.aromatize() # Calculate similarity between "similarity" fingerprints print("Similarity fingerprints:"); fp1 = m1.fingerprint("sim"); fp2 = m2.fingerprint("sim"); print(" Tanimoto: %s" % (indigo.similarity(fp1, fp2, "tanimoto"))); print(" Tversky: %s" % (indigo.similarity(fp1, fp2, "tversky"))); # Calculate similarity between "substructure" fingerprints print("Substructure fingerprints:"); fp1 = m1.fingerprint("sub"); fp2 = m2.fingerprint("sub"); print(" Tanimoto: %s" % (indigo.similarity(fp1, fp2, "tanimoto"))); print(" Tversky: %s" % (indigo.similarity(fp1, fp2, "tversky"))); Indigo/C++ Same calculations as in Indigo/Python but using Indigo core C++ library. #include "base_cpp/scanner.h" #include "molecule/molecule.h" #include "molecule/smiles_loader.h" #include "molecule/molecule_arom.h" #include "molecule/molecule_fingerprint.h" #include "base_c/bitarray.h" void _Fingerprints (Molecule &mol1, Molecule &mol2, MoleculeFingerprintParameters &params) { MoleculeFingerprintBuilder builder1(mol1, params); MoleculeFingerprintBuilder builder2(mol2, params); int fpsize = params.fingerprintSize(); builder1.process(); builder2.process(); int ones1 = bitGetOnesCount(builder1.get(), fpsize); int ones2 = bitGetOnesCount(builder2.get(), fpsize); int common_ones = bitCommonOnes(builder1.get(), builder2.get(), fpsize); float tanimoto = 0, tversky = 0; if (common_ones > 0) { tanimoto = (float)common_ones / (ones1 + ones2 - common_ones); tversky = 2.f * common_ones / (ones1 + ones2); } printf(" Tanimoto: %f\n Tversky: %f\n", tanimoto, tversky); } int main (void) { const char *smiles1 = "CC©C=CCCCCC(=O)NCc1ccc(c(c1)OC)O"; const char *smiles2 = "COC1=C(C=CC(=C1)C=O)O"; Molecule mol1, mol2; try { BufferScanner scanner1(smiles1); SmilesLoader loader1(scanner1); loader1.loadMolecule(mol1, false); mol1.calcImplicitHydrogens(true); MoleculeAromatizer::aromatizeBonds(mol1); BufferScanner scanner2(smiles2); SmilesLoader loader2(scanner2); loader2.loadMolecule(mol2, false); mol2.calcImplicitHydrogens(true); MoleculeAromatizer::aromatizeBonds(mol2); MoleculeFingerprintParameters params1, params2; memset(&params1, 0, sizeof(params1)); memset(&params2, 0, sizeof(params2)); // 64 bytes -- default value in Bingo for similarity search params1.sim_qwords = 8; // 200 bytes -- default value in Bingo for substructure search params2.ord_qwords = 25; printf("Similarity fingerprints:\n"); _Fingerprints(mol1, mol2, params1); printf("Substructure fingerprints:\n"); _Fingerprints(mol1, mol2, params2); } catch (Exception &e) { fprintf(stderr, "error: %s\n", e.message()); return -1; } return 0; } Instructions: #Unpack 'graph' and 'molecule' projects into some folder #Create 'utils' folder nearby #Paste the above code into utils/similarity.cpp file #Compile the file using the following commands: $ cd graph; make CONF=Release32; cd .. $ cd molecule; make CONF=Release32; cd .. $ cd utils $ gcc similarity.cpp -o false_positives -O3 -m32 -I.. -I../common ../molecule/dist/Release32/GNU-Linux-x86/libmolecule.a ../graph/dist/Release32/GNU-Linux-x86/libgraph.a -lpthread -lstdc++ #Run the program like that: $ ./similarity Expected result: Similarity fingerprints: Tanimoto: 0.448276 Tversky: 0.619048 Substructure fingerprints: Tanimoto: 0.436823 Tversky: 0.608040 OpenBabel/Pybel import pybel mol1 = pybel.readstring("smi", "CC©C=CCCCCC(=O)NCc1ccc(c(c1)OC)O") mol2 = pybel.readstring("smi", "COC1=C(C=CC(=C1)C=O)O") print mol1.calcfp() | mol2.calcfp() This reports a similarity of 0.360465116279. OpenBabel/Rubabel require 'rubabel' (mol1, mol2) = %w{CC©C=CCCCCC(=O)NCc1ccc(c(c1)OC)O COC1=C(C=CC(=C1)C=O)O}.map{|sml| Rubabelsml} puts mol1.tanimoto(mol2) OpenEye/Python I think LINGOS is the preferred similarity measure in OEChem but it gives a much lower similarity value for these two structures than I expected, so I also showed how to use its path fingerprints. from openeye.oechem import * from openeye.oegraphsim import * sim = OELingoSim("CC©C=CCCCCC(=O)NCc1ccc(c(c1)OC)O") print "LINGOS similarity:", sim.Similarity("COC1=C(C=CC(=C1)C=O)O") def make_fp(smiles): mol = OEGraphMol() OEParseSmiles(mol, smiles) fp = OEFingerPrint() OEMakePathFP(fp, mol) return fp fp1 = make_fp("CC©C=CCCCCC(=O)NCc1ccc(c(c1)OC)O") fp2 = make_fp("COC1=C(C=CC(=C1)C=O)O") print "Hash similarity:", OETanimoto(fp1, fp2) The output is LINGOS similarity: 0.0425531901419 Hash similarity: 0.374125868082 RDKit/Python from rdkit import Chem,DataStructs mol1 = Chem.MolFromSmiles("CC©C=CCCCCC(=O)NCc1ccc(c(c1)OC)O") mol2 = Chem.MolFromSmiles("COC1=C(C=CC(=C1)C=O)O") # the default fingerprint is path-based: fp1 = Chem.RDKFingerprint(mol1) fp2 = Chem.RDKFingerprint(mol2) print "RDK fingerprint: ",DataStructs.TanimotoSimilarity(fp1,fp2) # the Morgan fingerprint (similar to ECFP) is also useful: from rdkit.Chem import rdMolDescriptors mfp1 = rdMolDescriptors.GetMorganFingerprint(mol1,2) mfp2 = rdMolDescriptors.GetMorganFingerprint(mol2,2) print "Morgan fingerprint: ",DataStructs.DiceSimilarity(mfp1,mfp2) The output is RDK fingerprint: 0.471502590674 Morgan fingerprint: 0.505494505495 Cactvs/Tcl puts [prop compare E_SCREEN \ [ens get [ens create "CC(C)C=CCCCCC(=O)NCc1ccc(c(c1)OC)O" E_SCREEN] \ get [ens create "COC1=C(C=CC(=C1)C=O)O" E_SCREEN] tanimoto]/100.0] This computes the Tanimoto similarity on the standard pattern-based fingerprint E_SCREEN (result is 0.68). The toolkit supports various other similarity measures (Cosine,Dice,Hamman,Tversky,Kulcynski,Pearson,Russel-Rao,Simson,Yule) and alternative fingerprints (both fragment- and path-based). And there is no need to look up the SMILES strings, we can directly work with PubChem CIDs (or SIDs): puts [prop compare E_SCREEN [ens get [ens create 1548943 E_SCREEN] \ get [ens create 1183 E_SCREEN] tanimoto]/100.0] The example above uses the fingerprints delivered as part of the PubChem structure data. Since PubChem uses a longer fingerprint than the default, the result is slightly different (0.7). To arrive at identical results, either add a prop setparam E_SCREEN extended 2 command to the first example, or implicitly force re-computation of the fingerprint bits by specifying a computation parameter in the second example as in puts [prop compare E_SCREEN \ [ens get 1548943 E_SCREEN {} {extended 0} \ get 1183 E_SCREEN {} {extended 0} tanimoto]/100.0] Here we also further simplify the ensemble creation by instantiating a transient structure directly from the PubChem CID. Cactvs/Python And here again the equivalent Python solutions: With object creation from SMILES: print(Prop.Compare('E_SCREEN',Ens('CC©C=CCCCCC(=O)NCc1ccc(c(c1)OC)O').E_SCREEN, Ens('COC1=C(C=CC(=C1)C=O)O').E_SCREEN,'tanimoto')/100.0) With structure creation from CID: print(Prop.Compare('E_SCREEN',Ens(1548943).E_SCREEN, Ens(1183).E_SCREEN,'tanimoto')/100.0) With transient structure objects and computation parameter check: print(Prop.Compare('E_SCREEN', Ens.Get(1548943,'E_SCREEN',parameters={'extended':0}), Ens.Get(1183,'E_SCREEN',parameters={'extended':0}),'tanimoto')/100.0) Category:similarity Category:OpenBabel/Pybel Category:OpenEye/Python Category:Cactvs/Tcl Category:Indigo/C++ Category:Indigo/Python Category:CDK/Groovy Category:Cactvs/Python